home *** CD-ROM | disk | FTP | other *** search
- Path: beta.nedernet.nl!usenet
- From: jos@and.nl (Jos A. Horsmeier)
- Newsgroups: comp.misc,comp.unix.programmer,comp.lang.c,comp.programming
- Subject: Re: Full Text Search Algorithms
- Date: 17 Apr 1996 10:19:47 GMT
- Organization: AND Operations Research B.V.
- Message-ID: <4l2gk3$lgd@beta.nedernet.nl>
- References: <3174436E.1A36@flashnet.it>
- NNTP-Posting-Host: klepzeiker.and.nl
- Mime-Version: 1.0
- Content-Type: Text/Plain; charset=ISO-8859-1
- X-Newsreader: WinVN 0.99.5
-
- In article <3174436E.1A36@flashnet.it>, intarget@flashnet.it wrote:
-
- |I need to build a full text search engine which can handle a large amount
- |of documents (about 300,000 plain text documents 3Kbytes long).
- |I tried with by implementing inverted lists of words with BTrees indexes,
- |but the resulting software becomes too slow after the first 20,000
- |documents. In addition, there's too much wasted space in the indexes.
- |
- |Can anybody suggest me a better algorithm and, if possible, tell me where
- |can I find a technical description of it (possibly on the NET)?
-
- Maybe the following approach helps you out:
-
- - construct a list of unique words such that if a word is an element
- of this list, then this word occurs at least once in at least one
- document.
-
- - for each word in the list above, construct a bit list such that
- if a bit x in this list is set, then the word occurs in document
- number x.
-
- Words like 'the', 'and' etc. most likely have all their bits set
- in their bit lists and may be removed from the first list. Further
- reduction, i.e. compression of the bit lists, greatly reduces the
- total size of the database. Simple tricks like run length encoding
- come to mind ...
-
- kind regards,
-
- Jos aka jos@and.nl
- --
- Atnwgqkrl gy zit vgksr, ug qshiqwtzoeqs!
-
-